Skip to content

Conversation

@camille-bouvy-frequenz
Copy link
Contributor

@camille-bouvy-frequenz camille-bouvy-frequenz commented Nov 21, 2024

This PR introduces a backoff module that handles function calls with configurable timeout, which also increases with every timed-out call. This is then used for all non-streaming gRPC-related methods of the Client. The default timeout is set to 300 seconds (i.e. 5min), with flexibility to adjust as needed.

@camille-bouvy-frequenz camille-bouvy-frequenz requested a review from a team as a code owner November 21, 2024 14:09
@github-actions github-actions bot added the part:docs Affects the documentation label Nov 21, 2024
@camille-bouvy-frequenz
Copy link
Contributor Author

This timeout is now increased dynamically with each failed connection attempt.

Copy link

@tiyash-basu-frequenz tiyash-basu-frequenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the grpc_call_with_timeout is invoked per call. Maybe I am wrong, but wasn't the idea to implement the backoff at the client level, and not at a call level?

If we indeed need to implement it at a client level, then we could send the commands via a channel to a central client task, that can impose a global backoff logic. It would also need logic to reset the backoff after a certain period passes since the last successful RPC call.

@camille-bouvy-frequenz
Copy link
Contributor Author

Hmm, fair point that the timeout could be increased at the client level. That said, some gRPCs inherently take much longer than others (e.g., list_gridpool_orders), which is why I was thinking of having this handled at the gRPC level. However, I’m definitely not opposed to managing it at the client level either.

For me, the most important thing is that we have a timeout in place (whether it’s dynamically decided or fixed) to avoid hanging calls.

@tiyash-basu-frequenz
Copy link

I see. Different RPCs having different timeouts needs to be considered, true.

The main objective of timeouts would be to reduce stress on the client's connection. Having per-call timeout violates this principle. One call might be backing off, but then a new call might go straight through. So my guess is that while you may see some benefits of this approach right away, it would be a matter of chance when the connection to the server gets stressed. The more client apps, the further it will regress.

@camille-bouvy-frequenz camille-bouvy-frequenz force-pushed the add-timout-requests branch 2 times, most recently from ff2e5ff to a0eb107 Compare November 28, 2024 16:43
@camille-bouvy-frequenz
Copy link
Contributor Author

I currently have a max_timeout (maximum timeout duration for gRPC calls), timeout_increment (increment value for timeout on each retry, e.g. 20secs) and a max_timeout_retries (maximum number of retry attempts when a timeout is reached). Having those 3 vars could ba a bit of an overkill, I could remove one of them if needed.

Copy link

@tiyash-basu-frequenz tiyash-basu-frequenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition itself looks good. I have a few improvement suggestions.

@camille-bouvy-frequenz camille-bouvy-frequenz changed the title Add timout to gRPC requests Add timeout to gRPC requests Dec 2, 2024
Copy link

@tiyash-basu-frequenz tiyash-basu-frequenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good! I would just suggest adding a few tests to nail down stability and prevent regression, and done.

@github-actions github-actions bot added the part:tests Affects the unit, integration and performance (benchmarks) tests label Dec 4, 2024
@shsms
Copy link
Contributor

shsms commented Dec 9, 2024

Sorry, I'm just realizing this has become a retry implementation for non-streaming calls.

Retries of this sort have to be infinite because we have to keep trying until the API call succeeds, and doing that in the client would take away the actor's agency in deciding what to do when there is a client error, like taking fallback action, sending a slack message, etc.

It also introduces the risk of creating orders with outdated power or price forecasts if the client's retry succeeds after a few minutes.

For this reason, all of our clients should have retries only for streaming APIs (which are taken care of by the GrpcStreamBroadcaster in the base client).

For non-streaming APIs, retries can be implemented in the actors using the ExponentialBackoff provided by the base client. I'm also adding a receiver interface for this in the base client, allowing us to use a select loop to decide when to retry.

Also, I've implemented these in the actor for now in this PR: https://github.com/frequenz-io/frequenz-actor-electricity-trading/pull/249

So, I think this PR should go back to just implementing a timeout for non-streaming methods to prevent them from blocking for too long and limiting the actor's ability to take fallback action.

@tiyash-basu-frequenz
Copy link

Then all that needs to be done is invoke only simple RPCs with the backoff, and call the streaming ones directly. If a backoff module exists in the base client, then that could be used directly (and sorry that you implemented it here @camille-bouvy-frequenz).

I also would not recommend going back to a non-global backoff, since I do not see a value in that.

@shsms
Copy link
Contributor

shsms commented Dec 10, 2024

Then all that needs to be done is invoke only simple RPCs with the backoff

Not in the client, for the reasons mentioned above. We just need timeout to be implemented in this PR.

I also would not recommend going back to a non-global backoff, since I do not see a value in that.

Top level retry is provided by the actor implementation. That gives the most flexibility. And there are plans to add additional backoff mechanisms and failure detection based on run times, etc. And then it is needed only in one place. If users need additional control, they can do it in their actors.

@camille-bouvy-frequenz camille-bouvy-frequenz force-pushed the add-timout-requests branch 2 times, most recently from 32305d4 to f5cdbe7 Compare December 10, 2024 13:34


async def grpc_call_with_timeout(
call: Callable[..., Awaitable[Any]], *args: Any, timeout: float = 300, **kwargs: Any
Copy link
Contributor

@shsms shsms Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should take a timedelta | None and the default value should be None here and in the other calls. Otherwise it wouldn't be intuitive and not what you'd expect from a thin wrapper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about the timedelta but I don't think that the timeout should be None by default. Else we risk having the API stuck again (without giving any errors) like we did last week

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the actor would set it when making calls to the client. The actor can get the value to use from the config file.

The 300s comes from specific issues we noticed in the server and what we assume the actor needs. We can't say that 300 is the right value in the long term. It could become different even as soon as we go to production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the 300s was arbitrary just to have some timeout in the API requests. And the issues didn't come from the actor's side but from the API side when it disconnected (without throwing any errors). I don't mind changing the 300s to another value, but I'd feel more comfortable having a timeout by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the 300s was arbitrary just to have some timeout in the API requests.

This is just the thing. Defaults have to make sense, and we can't find a value that does, especially when we expect external parties to use it. Maybe we shouldn't add this here at all and keep it only in the actors? If, at some point, we get an SDK-like layer between the client and the actor, these problems would become much easier.

Maybe @llucax has an opinion? Because we try to be consistent with all the API clients.

In the worst case, we go ahead with 300s as the default if it works for most cases.

For example, the default receiver buffer size of 50 is a bit arbitrary and has bitten me sometimes. It didn't occur to me for a long time that it was overflowing and needed a bigger buffer. But still, having 50 as a default has been very useful in most places. The situation here is very different because None as buffer size would have meant no buffer, whereas here None as timeout means infinite timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea ok I get your point. I can then set the default to None for now then & just make sure we call the functions with timeouts in the actors

@tiyash-basu-frequenz
Copy link

Then all that needs to be done is invoke only simple RPCs with the backoff

Not in the client, for the reasons mentioned above. We just need timeout to be implemented in this PR.

I also would not recommend going back to a non-global backoff, since I do not see a value in that.

Top level retry is provided by the actor implementation. That gives the most flexibility. And there are plans to add additional backoff mechanisms and failure detection based on run times, etc. And then it is needed only in one place. If users need additional control, they can do it in their actors.

So if I understand it right, you specifically are suggesting having per-RPC backoffs in this client? If yes, how would that help?

Nice that a global backoff module exists in the actor library!

@shsms
Copy link
Contributor

shsms commented Dec 10, 2024

you specifically are suggesting having per-RPC backoffs in this client?

No, I'm saying we shouldn't add automatic retry in the client for non-streaming RPCs. There is no question of backoff if there's no retry.

I'm only suggesting that we have timeout for such RPCs.

Signed-off-by: camille-bouvy-frequenz <[email protected]>
shsms
shsms previously approved these changes Dec 11, 2024
Copy link
Contributor

@shsms shsms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an optional comment.

Signed-off-by: camille-bouvy-frequenz <[email protected]>
@camille-bouvy-frequenz camille-bouvy-frequenz added this pull request to the merge queue Dec 11, 2024
Merged via the queue into frequenz-floss:v0.x.x with commit b8c5439 Dec 11, 2024
14 checks passed
@camille-bouvy-frequenz camille-bouvy-frequenz deleted the add-timout-requests branch December 11, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

part:docs Affects the documentation part:tests Affects the unit, integration and performance (benchmarks) tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants